Loan Data Analysis by Eliu Schmitt (mail@eliu.de), Dec. 2016

The data from the peer to peer lending platform Prosper contains 113.937 entries with 81 variables.

The entries represent 113.066 unique credit listings pulled from the platform between November 2005 and March 2014. Double entries were removed from the dataset.

The 81 different variables contain information on the loan, the borrower and the lenders/investors.

Univariate Plots Section

About the Loan

Several variables contain information on the loan.

LoanOriginalAmount

LoanOriginalAmount: The origination amount of the loan.

The amount of the loans range from 1,000 to 35,000 USD. This are probably the maximum and minimum amounts specified by Prosper. The distribution is negatively skewed with very distinct modes at about 4,000, 7,500, 10,000, 15,000, 20,000 and 25,000 USD. People tend to round to these numbers even while the amount they need the loan for are probably not structured like this.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6300    8315   12000   35000

MonthlyLoanPayment

MonthlyLoanPayment: The scheduled monthly loan payment.

The monthly payment seems to be rather low around the 200 $. There is a very high count of loans with monthly payments of about 170 USD that should be investigated. The distributions seems to be multimodal with local peaks at 100, 170, 300, 500 and 700 USD.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.0   217.4   271.9   370.6  2252.0

LoanStatus

The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

The relative low number of cancelled loans is of no further interest and was removed from the dataset.

The amount of written off (‘Chargedoff’) and defaulted credits compared to the completed are unexpectedly high. About 30 % seem to be not able to pay back the loan.

The following chart shows a more simplified view to compare the successful and defaulted loans, counting “Defaulted” and “Chargedoff” as “Not Successful”, “Completed” as “Successful” and all other as “Open”.

The plot shows a failure rate of about 30 % of all loans (Success rate overall: 0.691).

## # A tibble: 3 × 3
##   LoanStatus.Bucket     n     ratio
##              <fctr> <int>     <dbl>
## 1    Not successful 17010 0.1504498
## 2              Open 57990 0.5129090
## 3        successful 38061 0.3366413
## # A tibble: 2 × 3
##   LoanStatus.Bucket     n    ratio
##              <fctr> <int>    <dbl>
## 1    Not successful 17010 0.308874
## 2        successful 38061 0.691126

Term

Term: The length of the loan expressed in months.

The absence of 2 year and 4 year loans is somewhat surprising. The majority were 3 year loans.

## 
##    36    60    12 
## 87219 24228  1614

BorrowerRate and LenderYield

Borrower Rate: The Borrower’s interest rate for this loan. LenderYield: The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.

The LenderYield is a bit lower than the BorrowerRate but follows the same distribution. The distributions seems to be roughly normal except for a distinct peak at ca. 33 %.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1929  0.2506  0.4975
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1245  0.1740  0.1829  0.2406  0.4925

ProsperScore

ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.
Applicable for loans originated after July 2009.

I treated the Score as a category rather than a continuous value since it is not known how the risk is scaled. It is also not known if 11 is a better score than 10 or if 11 marks loans without a score. Most risk scores fall in the middle range (4 to 8).

Information about the borrower

ListingCategory

The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

The borrowers mostly specified “Debt Consolidation” as their intended use what is reasonable. Quite some did not specify the purpose. Other notable uses were “Home Improvement” and “Business”.

Occupation

The Occupation selected by the Borrower at the time they created the listing.

The factor includes 68 unique options. The biggest group was “Other” and another group left the field blank. There is a good chance this was to protect their privacy. If the data is true, some of the occupations are surprising: I would guess that Doctors, Judges, Dentist or other respectable professions would have other - less expensive options - to fund them self.

EmploymentStatus

The employment status of the borrower at the time they posted the listing.

A simplified shows that the unemployed are a minority.

EmploymentStatusDuration

The length in months of the employment status at the time the listing was created.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.06  137.00  755.00    7621

IsBorrowerHomeowner

A Borrower will be classified as a homeowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.

About every second borrower is a home owner. This might raise the question why they wont use the house to get a bank loan on the house.

CurrentlyInGroup / GroupKey

Specifies whether or not the Borrower was in a group at the time the listing was created.

The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.

Only 12,712 borrowers were in one of 707 distinct groups.

##  False   True 
## 100349  12712
## [1] 707

CurrentDelinquencies

Number of accounts delinquent at the time the credit profile was pulled.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.595   0.000  83.000     696

AmountDelinquent

Dollars delinquent at the time the credit profile was pulled.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0      0.0      0.0    986.8      0.0 463900.0     7618

IncomeRage

IncomeRange The income range of the borrower at the time the listing was created.

Most borrowers specified an income range. The majority had an annual income between 25,000 USD and 100,000 USD with quite some incomes above 100,000 USD. This was unexpected, since these borrowers would probably also have access to cheaper loan from other sources.

IncomeVerifiable

IncomeVerifiable The borrower indicated they have the required documentation to support their income.

Most borrowers marked their income as verifiable.

##  False   True 
##   8586 104475

StatedMonthlyIncome

StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created.

The stated monthly income ranges from 0 to nearly 2 Mil. 75 % stated an income of at leas 3,200 USD per month. There are several very distinct peaks in the histogram maybe indication some sort of clusters.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5605    6825 1750000

About the Investors

Recommendations

Recommendations: Number of recommendations the borrower had at the time the listing was created.

Only 4,257 borrowers had Recommendations at all. Maybe the feature wasn’t used that much.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.04838  0.00000 39.00000
## [1] "Without recommendations"
## [1] 108804
## [1] "With recommendations"
## [1] 4257

InvestmentFromFriendsCount

InvestmentFromFriendsCount: Number of friends that made an investment in the loan.

Only 2,131 borrowers hand friends investing in them through the platform. This might be due to the reason that Prosper takes a few and friends can just borrow money directly skipping the fee.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.02364  0.00000 33.00000
## [1] 2131

Investors

Investors The number of investors that funded the loan.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.88  116.00 1189.00

Univariate Analysis

What is the structure of your dataset?

There are 113,937 unique loans from the peer to peer lending platform Prosper. The data contains 81 different features about the loan, the borrower and the investors/lenders.

I analyzed the following features:

  • LoanOriginalAmount Most loans were in a lower range. The median is 6,300 USD and the mean 8,135 USD.

  • MonthlyLoanPayment Most loan payments are below 370 USD per month (3rd quartile). There is an unexpected spike at 170 USD per month.

  • Loan Status Factor that shows if the loan is current, if it was payed back or if the borrower did not pay it pack.

  • Term The length of the loan expressed in months.

  • BorrowerRate
  • LenderYield

Interest for the borrower and the lender per annum. The max is with nearly 50 % very height. The mean and median are with about 18.4 % and 19.3 % are also very high.

  • Prosper Score A custom risk score built using historical Prosper data. Used as a factor.

  • ListingCategory The category of the listing that the borrower selected when posting their listing.

Many borrowers did not specify a category. Those who did mostly choose Debt Consolidation.

  • Occupation The Occupation selected by the Borrower at the time they created the listing.

The most often chosen occupation was “Other”.

  • EmploymentStatus The employment status of the borrower at the time they posted the listing.

  • EmploymentStatusDuration The length in months of the employment status at the time the listing was created.

  • IsBorrowerHomeowner About half of the borrowers specified they were homeowners.

  • CurrentlyInGroup
  • GroupKey Specifies whether or not the Borrower was in a group at the time the listing was created.

The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation

Only few people were in a group.

  • CurrentDelinquencies Number of accounts delinquent at the time the credit profile was pulled.

  • AmountDelinquent Dollars delinquent at the time the credit profile was pulled.

  • IncomeRage IncomeRange The income range of the borrower at the time the listing was created.

A factor containing the levels “$0” “$1-24,999” “$25,000-49,999” “$50,000-74,999” “$75,000-99,999”
“$100,000+” “Not employed” “Not displayed”

  • IncomeVerifiable Most borrowers marked their income as verifiable.

  • StatedMonthlyIncome StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created.

75 % stated an income of at leas 3,200 USD per month.

  • DeptToIncomeRatio

  • Recommendations Recommendations: Number of recommendations the borrower had at the time the listing was created.

Only 4,257 borrowers had Recommendations at all.

  • InvestmentFromFriendsCount Only 2,131 borrowers hand friends investing in them through the platform.

  • Investors The number of investors that funded the loan.

The listings were pulled from the Prosper database between November 2005 and March 2014. Double entries were removed from the dataset.

What is/are the main feature(s) of interest in your dataset?

The main interest is in these two features:

  • LoanStatus
  • BorrowerYield

I try to find out which features can predict if the loan will be payed back and what features influence the yield of the loan.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

The income, occupation, employment status will likely influence the success and the yield of the loan.

Did you create any new variables from existing variables in the dataset?

I created a simplified factor for the loan status only looking at Open, Successful, Not successful. Overdue payments were counted as Open and defaulted and written off were both counted as Not successful.

I created a new factor only distinguishing between employed or not employed.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

There were a few hundred double entries that were cleaned. These were inspected and removed.

Bivariate Plots Section

I’m first looking at the difference between the group of successful loans and the loans that were not payed back. The table shows several differences.

StatedIncome and Success

## # A tibble: 2 × 7
##   LoanStatus.Bucket mean_stated_income mean_loan_amount mean_borrower_rate
##              <fctr>              <dbl>            <dbl>              <dbl>
## 1    Not successful           4450.824         6424.842          0.2317624
## 2        successful           5324.390         6188.146          0.1864263
## # ... with 3 more variables: mean_debt_to_inc_ratio <dbl>,
## #   mean_duration_employment <dbl>, n <int>

There is a strong correlation between the success of an loan an the stated monthly income. The difference between the groups is about - 666 USD in income per month.

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  StatedMonthlyIncome by LoanStatus.Bucket
## W = 272290000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -717.3333 -625.0000
## sample estimates:
## difference in location 
##              -666.6667
## # A tibble: 2 × 4
##   LoanStatus.Bucket mean_income median_income     n
##              <fctr>       <dbl>         <dbl> <int>
## 1    Not successful    4450.824      3750.000 17010
## 2        successful    5324.390      4416.667 38061

Amount and Success

The amount of the loan follows a similar distribution for the successful and not successful loans. The mean amount for the failed is a bit higher. At an alpha level of 0.05 the difference would still be significant. The effect is very low. This might change looking only at specific parts. All loans above 25,000 USD were always payed back.

## # A tibble: 2 × 5
##   LoanStatus.Bucket mean_amount median_amount max_amount min_amount
##              <fctr>       <dbl>         <dbl>      <int>      <int>
## 1    Not successful    6424.842          4500      25000       1000
## 2        successful    6188.146          4500      35000       1000
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  LoanOriginalAmount by LoanStatus.Bucket
## W = 327860000, p-value = 0.01594
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  6.215555e-05 1.190726e-05
## sample estimates:
## difference in location 
##           6.037318e-05

Term and Success

The short 12 month loans seems the be very save with a success rate of nearly 95 %. The difference between the 3 year and the 5 year loan is relatively low.

## # A tibble: 3 × 3
##    Term success_ratio     n
##   <int>         <dbl> <int>
## 1    12     0.9464752  1532
## 2    36     0.6859189 49847
## 3    60     0.6554713  3692

BorrowerRate and Success

The borrower rate for the successful loans is a lot lower than for the not successful loans. The average rate is 4.95 % lower with the successful loans.

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  BorrowerRate by LoanStatus.Bucket
## W = 4.3e+08, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  0.04767575 0.05000017
## sample estimates:
## difference in location 
##             0.04948877

Employment and Success

As expected, the employed borrowers tend to pay back their loan more often, although the failure rate for the unemployed is still below 50 %.

## # A tibble: 2 × 3
##     IsEmployed success_ratio     n
##         <fctr>         <dbl> <int>
## 1     Employed     0.7110333 45417
## 2 Not employed     0.5974726  9654
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(ld.without_current$LoanStatus.Bucket, ld.without_current$IsEmployed)
## X-squared = 480.44, df = 1, p-value < 2.2e-16

Investors and Success

The successful loans had more investors on average. This might point to some sort of social network effect, people warning each other or similar.

## # A tibble: 2 × 4
##   LoanStatus.Bucket max_investors mean_investors median_investors
##              <fctr>         <int>          <dbl>            <dbl>
## 1    Not successful           881       97.45256               60
## 2        successful          1189      108.25094               74
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Investors by LoanStatus.Bucket
## W = 297990000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -9.999995 -7.000002
## sample estimates:
## difference in location 
##              -8.000034

InvestmentFromFriends and Success

Loans with investment from friends were more likely to be successful. Most people did not have investment from friends.

## # A tibble: 2 × 4
##   LoanStatus.Bucket  max_      mean_ median_
##              <fctr> <int>      <dbl>   <dbl>
## 1    Not successful     9 0.03386243       0
## 2        successful    33 0.05160138       0
## # A tibble: 2 × 3
##   HasInvestmentFromFriends success_ratio     n
##                      <lgl>         <dbl> <int>
## 1                    FALSE     0.6884164 53058
## 2                     TRUE     0.7625435  2013

ProsperScore and Success

As expected, the success rate becomes better with a higher ProsperScore. The loans without a prosper score were removed as outliers from the following plot.

## # A tibble: 12 × 3
##    ProsperScore success_ratio     n
##          <fctr>         <dbl> <int>
## 1             1     0.5562806   613
## 2             2     0.6542923  1293
## 3             3     0.6875399  1565
## 4             4     0.6692308  2730
## 5             5     0.6734694  3283
## 6             6     0.7215318  3943
## 7             7     0.7470644  3151
## 8             8     0.8216659  4514
## 9             9     0.8818040  3215
## 10           10     0.9534180  1653
## 11           11     0.9687500    32
## 12           NA     0.6331029 29079

Listing and Success

The borrower could state a purpose for the loan. Only looking at categories with n > 50, Motorcycle was the best. No specification is a risk since these were the least reliable loans. The most common category Debt Restructuring still had a slightly better than average rate.

## # A tibble: 16 × 3
##    ListingCategory..names. success_ratio     n
##                      <chr>         <dbl> <int>
## 1               Motorcycle     0.9148936    94
## 2          Engagement Ring     0.9047619    63
## 3                     Auto     0.7862248  1539
## 4                    Other     0.7531457  6040
## 5         Home Improvement     0.7502313  3243
## 6                    Taxes     0.7476636   214
## 7          Large Purchases     0.7458564   181
## 8              Student Use     0.7420635   756
## 9       Debt Consolidation     0.7368126 17858
## 10                Vacation     0.7323232   198
## 11           Wedding Loans     0.7309942   171
## 12           Personal Loan     0.6764092  2395
## 13                Business     0.6689380  4256
## 14          Medical/Dental     0.6459459   370
## 15      Household Expenses     0.6214405   597
## 16           Not Available     0.6049681 16948

Occupation and Success

Only looking at professions who had at least 100 loans, Military Officers seem to be the most reliable.

## # A tibble: 49 × 4
##                            Occupation success_ratio fail_ratio     n
##                                <fctr>         <dbl>      <dbl> <int>
## 1                    Military Officer     0.9004975 0.09950249   201
## 2                            Attorney     0.8868360 0.11316397   433
## 3  Student - College Graduate Student     0.8834951 0.11650485   206
## 4                           Scientist     0.8763441 0.12365591   186
## 5                 Computer Programmer     0.8311949 0.16880513  2494
## 6                             Analyst     0.8165447 0.18345526  1777
## 7                              Doctor     0.8122066 0.18779343   213
## 8                 Engineer - Chemical     0.8000000 0.20000000   110
## 9               Engineer - Electrical     0.8000000 0.20000000   545
## 10           Student - College Senior     0.7878788 0.21212121   165
## # ... with 39 more rows

The Military Officers also payed less interest.

## [1] "All borrowers:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1350  0.1945  0.2004  0.2699  0.4975
## [1] "Only Military Officers:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0425  0.0920  0.1500  0.1708  0.2400  0.3500

Homeowner and Success

The homeowners were a bit more likely to pay back the loan but not by much. The p value is still very low so the difference is still significant.

## # A tibble: 2 × 4
##   IsBorrowerHomeowner success_ratio fail_ratio     n
##                <fctr>         <dbl>      <dbl> <int>
## 1               False     0.6778117  0.3221883 29191
## 2                True     0.7061437  0.2938563 25880
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(ld.without_current$IsBorrowerHomeowner, ld.without_current$LoanStatus.Bucket)
## X-squared = 51.451, df = 1, p-value = 7.342e-13

BorrowerRate

The interest rate (BorrowerRate) is higher for small loans. This was not expected as the mean amount for the unsuccessful loans was higher than for the successful therefore smaller loans seem to be less risky.

## 
##  Pearson's product-moment correlation
## 
## data:  ld$BorrowerRate and ld$LoanOriginalAmount
## t = -116.89, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3335464 -0.3231454
## sample estimates:
##        cor 
## -0.3283558

Borrowers who stated a higher monthly income payed less interest. The correlations is not very strong.

## 
##  Pearson's product-moment correlation
## 
## data:  ld$BorrowerRate and ld$StatedMonthlyIncome
## t = -29.95, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09450222 -0.08293603
## sample estimates:
##         cor 
## -0.08872211

In plot shows that the interest rate goes down with higher monthly payments. This is not expected since higher interest should lead to higher monthly payments. In the plot some steep “stripes” are visible that represent the correlation that is expected. These stripes should represent loans of the same amount and duration that just differ by the interest rate.

## 
##  Pearson's product-moment correlation
## 
## data:  ld$BorrowerRate and ld$MonthlyLoanPayment
## t = -84.592, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2494531 -0.2384891
## sample estimates:
##        cor 
## -0.2439789

The not successful loans usually had to pay an higher interest rate.

The unemployed had to pay higher interest on their loan.

Doctors pay the lowest interest on average. The Military Officer who had the best reliability only makes 13th place.

## # A tibble: 62 × 3
##                    Occupation mean_interest     n
##                        <fctr>         <dbl> <int>
## 1                      Doctor     0.1608967   492
## 2                  Pharmacist     0.1640706   255
## 3         Engineer - Chemical     0.1675559   222
## 4         Computer Programmer     0.1680567  4442
## 5                    Attorney     0.1684318  1033
## 6  Pilot - Private/Commercial     0.1688667   198
## 7       Engineer - Electrical     0.1692419  1118
## 8                   Professor     0.1707583   556
## 9                   Scientist     0.1710580   367
## 10                  Architect     0.1719108   212
## # ... with 52 more rows

No. of Investors and Success

The plot shows how the successful and not successful loans are connected through the numbers of investors. More investors mean a better chance of success.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features

in the dataset?

I found a relationship between the success that a loan will be payed back or not with several features. A higher ProsperScore was an indicator, as well as the StatedMonthlyIncome, the EmploymentStatus, the Occupation and the BorrowerRate. The Term (length) of the loan was most interesting for the (few) 1 year loans. The amount of the loan also seems to be not highly connected.

The BorrowerRate (interest rate the borrower had to pay), vary with the ProsperScore, the StatedMonthlyIncome and the EmploymentStatus. The interest is also lower for loans with a higher amount. The rate also varies with the occupation.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

The StatedMonthlyIncome and the LoanAmount are correlated. The stated monthly income for Loans greater than 25,000 USD was equal or above 8,333 USD a month. All lower loans also had borrowers without a stated income (or an stated income of 0 USD). Higher loans also mostly stated DeptRestruturing as the main purpose. This is surprising, as a high monthly income should secure other loans as well.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8333    9583   11600   13340   14580   66670

## 
##  Pearson's product-moment correlation
## 
## data:  ld$StatedMonthlyIncome and ld$LoanOriginalAmount
## t = 68.854, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1950122 0.2062010
## sample estimates:
##       cor 
## 0.2006131

What was the strongest relationship you found?

The strongest relationships are those obviously needed by the business model. Higher loans correspond with higher monthly payments, as well as higher interest rates correspond with higher monthly payments.

Besides this, the 12 Month Term and the ProsperScore of 10 had a very strong relationship with the success of the loan.

Multivariate Plots Section

Term, Monthly Payment and Amount

The following plot shows the relationship between the monthly payment and the amount dependent on the term of the loan. There is a difference in the relationship starting at loans grater than 25,000 USD. The width of the distribution is narrowing. Like in the chart plotting the loan amount towards the interest rate there is less variance after 25,000 USD.

Rate, Income and LoanAmount

The following plot shows the distribution of the loans by the amount and the stated monthly income. The color indicates the rate the borrower had to pay. The vertical stripes show the clusters around even amounts. The lighter colors in the “Not successful” facet show how the lower amounts and lower incomes were paying higher interest.

Income, BorrowerRate, Success

The following plot shows the distribution of the stated income and the rate the borrower had to pay. The two facets show the distribution for the successful and the not successful loans. The alpha level was adjusted to account for the different sizes of the facets. The successful facet shows a thick cluster at a rate of 0.1 where there is only few in the not successful facet.

Borrower Rate, Prosper Score and Success

The plot shows how the BorrowerRate, the ProsperScore and the success of the loan correspond. The higher chance of success is easily spotted as well at the lower interest rate borrowers had to pay with a higher prosper score.

Employment, Categories and Success

The plot shows the strong effect of the employment of the borrower and how this distribution varies for different listing categories. The plot shows that Debt Restructuring was mostly stated by employed, while unemployed often also did not specify a listing category.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

Employment and DebtRestructuring

While the employment correlates with a better success chance and the purpose DeptRestructuring also had a better chance of success this was partly reversed when combining the two features. DeptRestructuring lead to an even better success chance for the employed, but lowered the chance for the unemployed.

## # A tibble: 2 × 3
##     IsEmployed success_ratio     n
##         <fctr>         <dbl> <int>
## 1     Employed     0.7110333 45417
## 2 Not employed     0.5974726  9654
## # A tibble: 2 × 3
##   IsDeptRestructuring success_ratio     n
##                 <lgl>         <dbl> <int>
## 1               FALSE     0.6692016 37213
## 2                TRUE     0.7368126 17858
## Source: local data frame [4 x 4]
## Groups: IsEmployed [?]
## 
##     IsEmployed IsDeptRestructuring success_ratio     n
##         <fctr>               <lgl>         <dbl> <int>
## 1     Employed               FALSE     0.6916209 28225
## 2     Employed                TRUE     0.7429037 17192
## 3 Not employed               FALSE     0.5987984  8988
## 4 Not employed                TRUE     0.5795796   666

ProsperScore and StatedMonthlyIncome

A prosper score and a higher monthly income correlate with a higher success chance. Combined the income is higher at ProsperScore 10. Even so at lower scores the difference in income for the successful and unsuccessful is bigger. Furthermore, there isn’t a big difference in lower ProsperScores.


Final Plots and Summary

Plot One: Borrower Rate, Prosper Score and Success

Description One

The first plot shows how the BorrowerRate, the ProsperScore and the success of the loan correspond. The higher chance of success is easily spotted as well at the lower interest rate borrowers had to pay with a higher prosper score. Not including the few instances in score 11, the score of 10 had the best chance.

Plot Two: Employment, Categories and Success

Description Two

The second plot shows the strong effect of the employment of the borrower and how this distribution varies for different listing categories. The plot shows that Debt Restructuring was mostly stated by employed, while unemployed often also did not specify a listing category.

Plot Three

Description Three

The third plot shows how the successful and not successful loans are connected through the numbers of investors. More investors mean a better chance of success. The unsuccessful loans had on average 8 investors less than the successful.


Reflection

I was able to look at different features in the dataset. Many features are significant for the determination if the loan will be payed back or not but mostly the effect strength is small. Some features have a very high influence but the feature is only limited to few instances and therefore often not of great help. While loans with a term of 12 month were very successful, only few loans were in this subgroup.

The problem of determining whether a borrower will pay back it’s loan or not remains difficult. The most widely available indicator for a success was if the borrower is employed.

Problematic with the dataset was also that most personal data are statements by the borrowers. They might made mistakes or lied.

Furthermore, some of the features I analyzed are only interesting in hindsight, since a typical question would be to determine the credit worthiness of a new borrower to determine the adequate interest rate.

Future questions might involve using some of the features to predict the success of the loan not yet closed. Besides this we might want to use a prediction to fill in the blanks for some features, like the occupation of a borrower.